Using a Large Set of EAGLES-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging
ثبت نشده
چکیده
The paper presents one way of reconciling data sparseness with the requirement of high accuracy tagging in terms of fine-grained tagsets. For lexicon encoding, EAGLES elaborated a set of recommendations aimed at covering multilingual requirements and therefore resulted in a large number of features and possible values. Such an encoding, used for tagging purposes, would lead to very large tagsets. For instance, our EAGLES-compliant lexicon required a set of about 1000 morpho-syntactic description codes (MSDs) which after considering some systematic syncretic phenomena, was reduced to a set of 614 MSDs. Building reliable language models (LMs) for this tagset would require unrealistically large training data (hand annotated/validated). Our solution was to design a hidden reduced tagset and use it in building various LMs. The underlying tagger uses these LMs to tag a new text in as many variants as LMs are available. The tag differences between these variants are processed by a combiner which chooses the most likely tags. In the end, the tagged text is subject to a conversion process that maps the tags from the reduced tagset onto the more informative tags from the large tagset. We describe this processing chain and provide a detailed evaluation of the results. Large tagsets and tiered tagging The paper discusses experiments and results concerned with tagging highly inflectional languages, based on multiple register diversified language models (LMs). The case study language is Romanian, for the tagset of which we adopted the internationally accepted set of EAGLES guidelines for morpho-syntactic encoding of lexica. The Romanian lexicon, EAGLES compliant, was built within the MULTEXT-EAST Copernicus Joint Project and the description of its almost half a million wordforms used a set of 614 morpho-syntactic description (MSD) codes. A full description of the encoding scheme we used is given in (Erjavec & Monachini, 1995). Multilingual content analyses of the MULTEXT-EAST lexica and corpora are
منابع مشابه
Using a Large Set of EAGLES-compliant Morpho-syntactic Descriptors as a Tagset for Probabilistic Tagging
The paper presents one way of reconciling data sparseness with the requirement of high accuracy tagging in terms of fine-grained tagsets. For lexicon encoding, EAGLES elaborated a set of recommendations aimed at covering multilingual requirements and therefore resulted in a large number of features and possible values. Such an encoding, used for tagging purposes, would lead to very large tagset...
متن کاملTiered Tagging and Combined Language Models Classifiers
We address the problem of morpho-syntactic disambiguation of arbitrary texts in a highly innectional natural language. We use a large tagset (615 tags), EAGLES and MULTEXT compliant 5]. The large tagset is internally mapped onto a reduced one (82 tags), serving statistical disambiguation, and a text disambiguated in terms of this tagset is subsequently subject to a recovery process of all the i...
متن کاملLarge tagset labeling using Feed Forward Neural Networks. Case study on Romanian Language
Standard methods for part-of-speech tagging suffer from data sparseness when used on highly inflectional languages (which require large lexical tagset inventories). For this reason, a number of alternative methods have been proposed over the years. One of the most successful methods used for this task, FDOOHG 7LHUHG 7DJJLQJ 7XIL , 1999), exploits a reduced set of tags derived by removing severa...
متن کاملDevelopping Tools and Building Linguistic Resources for Vietnamese Morpho-syntactic Processing
Vietnamese is spoken by about 80 millions people around the world, yet very few concrete works on this language have been noticed in Natural Language Processing (NLP) until now. The fundamental problems in automatic analysis of Vietnamese, such as part-ofspeech (POS) tagging, parsing, etc. are extremely difficult due to the lack of formal linguistic knowledge on one hand, and the specificities ...
متن کاملMorpho-syntactic ambiguity and tagset design for Hungarian
The paper reports on work in progress to develop a tag set for Hungarian. The rich morphological structure of the language makes tagging feasible only after a full-scale morphological analysis, which results in a magnitude of patterns that do not easily translate into a corpus tag set of manageable size. The paper analyses the extent and types of morpho-syntactic ambiguity found in a 21m word s...
متن کامل